This project explores a dataset containing the trip data of the fordgo bike in the greater San Francisco Bay area.
# Import the dependencies
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import calendar
from datetime import datetime, date
import plotly.express as px
%matplotlib inline
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
df = pd.read_csv("201902-fordgobike-tripdata.csv")
# checking the shape of the data
print("The dataset has {} columns and {} rows".format(df.shape[1],df.shape[0]))
print("There are {} duplicated rows in the data".format(df.duplicated().sum()))
The dataset has 16 columns and 183412 rows There are 0 duplicated rows in the data
df.head()
| duration_sec | start_time | end_time | start_station_id | start_station_name | start_station_latitude | start_station_longitude | end_station_id | end_station_name | end_station_latitude | end_station_longitude | bike_id | user_type | member_birth_year | member_gender | bike_share_for_all_trip | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52185 | 2019-02-28 17:32:10.1450 | 2019-03-01 08:01:55.9750 | 21.0 | Montgomery St BART Station (Market St at 2nd St) | 37.789625 | -122.400811 | 13.0 | Commercial St at Montgomery St | 37.794231 | -122.402923 | 4902 | Customer | 1984.0 | Male | No |
| 1 | 42521 | 2019-02-28 18:53:21.7890 | 2019-03-01 06:42:03.0560 | 23.0 | The Embarcadero at Steuart St | 37.791464 | -122.391034 | 81.0 | Berry St at 4th St | 37.775880 | -122.393170 | 2535 | Customer | NaN | NaN | No |
| 2 | 61854 | 2019-02-28 12:13:13.2180 | 2019-03-01 05:24:08.1460 | 86.0 | Market St at Dolores St | 37.769305 | -122.426826 | 3.0 | Powell St BART Station (Market St at 4th St) | 37.786375 | -122.404904 | 5905 | Customer | 1972.0 | Male | No |
| 3 | 36490 | 2019-02-28 17:54:26.0100 | 2019-03-01 04:02:36.8420 | 375.0 | Grove St at Masonic Ave | 37.774836 | -122.446546 | 70.0 | Central Ave at Fell St | 37.773311 | -122.444293 | 6638 | Subscriber | 1989.0 | Other | No |
| 4 | 1585 | 2019-02-28 23:54:18.5490 | 2019-03-01 00:20:44.0740 | 7.0 | Frank H Ogawa Plaza | 37.804562 | -122.271738 | 222.0 | 10th Ave at E 15th St | 37.792714 | -122.248780 | 4898 | Subscriber | 1974.0 | Male | Yes |
There are 183412 fordgobike trips in the dataset with 16 fields (duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude ,end_station_longitude, bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip).
My major interest is the duration, understanding how it relates with other fields in the dataset.
Trip duration should be dependent on day of the week as you expect more bike trips on weekends to weekdays. Similarly, I would expect males to take more and longer trips than females as biking requires strength.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 183412 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 183412 non-null int64 1 start_time 183412 non-null object 2 end_time 183412 non-null object 3 start_station_id 183215 non-null float64 4 start_station_name 183215 non-null object 5 start_station_latitude 183412 non-null float64 6 start_station_longitude 183412 non-null float64 7 end_station_id 183215 non-null float64 8 end_station_name 183215 non-null object 9 end_station_latitude 183412 non-null float64 10 end_station_longitude 183412 non-null float64 11 bike_id 183412 non-null int64 12 user_type 183412 non-null object 13 member_birth_year 175147 non-null float64 14 member_gender 175147 non-null object 15 bike_share_for_all_trip 183412 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.4+ MB
print(df.isnull().sum()) # checking for null records
duration_sec 0 start_time 0 end_time 0 start_station_id 197 start_station_name 197 start_station_latitude 0 start_station_longitude 0 end_station_id 197 end_station_name 197 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 8265 member_gender 8265 bike_share_for_all_trip 0 dtype: int64
# Drop NaN
df.dropna(inplace = True)
# Plotting the distribution of all numeric columns
df.hist(figsize= (10,10));
#changing data type of start_time and end_time to datetime.
df.start_time = pd.to_datetime(df.start_time)
df.end_time = pd.to_datetime(df.end_time)
# Extracting the month from start_time and en
month_start = df['start_time'].dt.month
month_end = df['end_time'].dt.month
df['start_month'] = month_start.apply(lambda x: calendar.month_abbr[x])
df['end_month'] = month_end.apply(lambda x: calendar.month_abbr[x])
# Extracting day of the week from start_time
day = df['start_time'].apply(lambda time: time.dayofweek)
day1 = df['end_time'].apply(lambda time: time.dayofweek)
days_week = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['start_day'] = day.map(days_week)
df['end_day'] = day1.map(days_week)
# Convert member_birth_year to interger
df['member_birth_year'] = df['member_birth_year'].astype(int)
# Extracting Age from member_birth_year
today = date.today()
df['age'] = today.year - df['member_birth_year']
df.start_month.value_counts()
Feb 174952 Name: start_month, dtype: int64
# b. Getting the time of the day the trip started
df['start_hour'] = df['start_time'].apply(lambda time: time.hour)
df['period_day'] = 'morning'
df['period_day'][(df['start_hour'] >= 12) & (df['start_hour'] <= 17)] = 'afternoon'
df['period_day'][(df['start_hour'] >= 18) & (df['start_hour'] <= 23)] = 'night'
# convert time period, month, and weekday into ordered categorical types
time_dict = {'period_day': ['morning', 'afternoon', 'night'],
'start_day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'],
'end_day': ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']}
for var in time_dict:
ordered_var = pd.api.types.CategoricalDtype(ordered = True,
categories = time_dict[var])
df[var] = df[var].astype(ordered_var)
df.describe()
| duration_sec | start_station_id | start_station_latitude | start_station_longitude | end_station_id | end_station_latitude | end_station_longitude | bike_id | member_birth_year | age | start_hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 | 174952.000000 |
| mean | 704.002744 | 139.002126 | 37.771220 | -122.351760 | 136.604486 | 37.771414 | -122.351335 | 4482.587555 | 1984.803135 | 37.196865 | 13.456165 |
| std | 1642.204905 | 111.648819 | 0.100391 | 0.117732 | 111.335635 | 0.100295 | 0.117294 | 1659.195937 | 10.118731 | 10.118731 | 4.734282 |
| min | 61.000000 | 3.000000 | 37.317298 | -122.453704 | 3.000000 | 37.317298 | -122.453704 | 11.000000 | 1878.000000 | 21.000000 | 0.000000 |
| 25% | 323.000000 | 47.000000 | 37.770407 | -122.411901 | 44.000000 | 37.770407 | -122.411647 | 3799.000000 | 1980.000000 | 30.000000 | 9.000000 |
| 50% | 510.000000 | 104.000000 | 37.780760 | -122.398279 | 101.000000 | 37.781010 | -122.397437 | 4960.000000 | 1987.000000 | 35.000000 | 14.000000 |
| 75% | 789.000000 | 239.000000 | 37.797320 | -122.283093 | 238.000000 | 37.797673 | -122.286533 | 5505.000000 | 1992.000000 | 42.000000 | 17.000000 |
| max | 84548.000000 | 398.000000 | 37.880222 | -121.874119 | 398.000000 | 37.880222 | -121.874119 | 6645.000000 | 2001.000000 | 144.000000 | 23.000000 |
# Showing the locations of the stations on the map
fig = px.scatter_mapbox(
df, # Our DataFrame
lat='start_station_latitude',
lon='start_station_longitude',
width=600, # Width of map
height=600, # Height of map
hover_data=["duration_sec"], # Display duration when hovering mouse over house
)
fig.update_layout(mapbox_style="open-street-map")
fig.show()
I start with exploring the duration of trips, age of riders, trip per days, period and stations.
# 1. Duration by seconds
binsize = 500
bins = np.arange(0, df['duration_sec'].max()+binsize, binsize)
plt.figure(figsize=[8, 6])
plt.hist(data = df, x = 'duration_sec', bins = bins)
plt.title('Distribution of Trip Durations')
plt.xlabel('Duration (sec)')
plt.ylabel('Number of Trips')
plt.axis([-500, 10000, 0, 90000])
plt.show()
Plot doesn't show the real distribution as it's affected by the very low and high value observations in the data
# 2 A log scale is plotted to adequately respond towards the outliers
log_binsize = 0.025
log_bins = 10 ** np.arange(2.4, np.log10(df['duration_sec'].max()) + log_binsize, log_binsize)
plt.figure(figsize=[8, 6])
plt.hist(data = df, x = 'duration_sec', bins = log_bins)
plt.title('Distribution of Trip Durations')
plt.xlabel('Duration (sec)')
plt.ylabel('Number of Trips')
plt.xscale('log')
plt.xticks([60,200,350,600,1000,2000,6000,10000,70000], [60,200,350,600,1000,2000,6000,10000,70000])
plt.show()
Duration of most trips are less than 2000 seconds. Most trips are between 500 seconds and 1000 seconds, with a slight peak in 600seconds. Just as mentioned before, we can also see some trips taking more than 70000 seconds (outliers).
# Distribution of the age of riders
binsize = 5
plt.figure(figsize=[8,6])
bins = np.arange(0, df['age'].max()+binsize, binsize)
plt.hist(data=df, x='age', bins=bins)
plt.xlabel('Age (Year)')
plt.ylabel('Count')
plt.title('Age distribution of Ford GoBike data');
# there's also a long tail in the distribution, a log scale is preferred
log_binsize = 0.025
bins = 10 ** np.arange(1.2, np.log10(df['age'].max())+log_binsize, log_binsize)
plt.figure(figsize=[8, 6])
plt.hist(data = df, x = 'age', bins = bins)
plt.xscale('log')
plt.xticks([10,20,30,35,40,50,70,90,100], [10,20,30,35,40,50,70,90,100])
plt.xlabel('Age (Year)')
plt.ylabel('Count')
plt.title('Log-Age Distribution of Riders');
Most of the riders of FordBike are younger, with majority between the ages of 30 to 40. We can also notice that we have riders above 100 years. This is quite suprising as biking requires some physical strength.
# View trips by month
df.start_month.value_counts()
df.end_month.value_counts()
Feb 174938 Mar 14 Name: end_month, dtype: int64
Since all the trips in the dataset happened in February, we wouldn't proceed to analyse by month.
We would proceed to analyse trips by days
df.start_day.value_counts()
Thu 33712 Tue 30584 Wed 28426 Fri 27663 Mon 25641 Sun 14512 Sat 14414 Name: start_day, dtype: int64
# Start Days
fig, ax = plt.subplots(nrows=2, figsize = [8,10])
sns.countplot(data = df, x ='start_day', color = 'blue', ax = ax[0])
sns.countplot(data = df, x ='end_day', color = 'blue', ax = ax[1])
ax[0].set_xlabel('Start Day')
ax[0].set_ylabel('Count')
ax[0].set_title('Day Distribution')
ax[1].set_xlabel('End Day')
ax[1].set_ylabel('Count')
ax[1].set_title('Day Distribution')
for p in ax[0].patches:
ax[0].annotate(f'\n{p.get_height()}', (p.get_x()+0.3, p.get_height()), ha='center', va='top', color='black', size=8)
for p in ax[1].patches:
ax[1].annotate(f'\n{p.get_height()}', (p.get_x()+0.3, p.get_height()), ha='center', va='top', color='black', size=8)
From our plot, we get to see that most trips started and ended on the same day. Thursday had the highest number of trips, for both the start day and end day.
Only Monday has the same number of trips that started and ended in the same day.
# Proportion of periods
df.period_day.value_counts() / len(df)
morning 0.385340 afternoon 0.383065 night 0.231595 Name: period_day, dtype: float64
# Period of the Day
sns.countplot(data = df, x ='period_day', color = 'blue')
plt.xlabel('Time of the Day')
plt.ylabel('Count')
plt.title('A bar chart of the start period of trips');
Our analysis show that trips are started majorly in the morning and afternoon. Little difference is noticeable in the morning anf afternoon periods. Both periods accounting for over 76% of the trips. Fewer trips start at night, only about 23% of trips started at night.
# Start hour of trip
sns.countplot(data = df, x ='start_hour', color = 'blue')
plt.xlabel('Start Hour')
plt.ylabel('Count')
plt.title('A bar chart of the start hour of trips');
Higher ride frequencies are noticed in the morning (7th, 8th and 9th hours). Similarly from the 16th to 18th hour, there is also a high frequency of ride.
This increased rides may be due to workers going to work and closing from work. This corroborates the previous analysis showing increased trip in the morning and afternoon.
# Plotting top 10 start stations
start_station = df.start_station_name.value_counts()
top_10_start_station = start_station.nlargest(10)
top_10_start_station.plot(kind = 'bar')
plt.xlabel('Stations')
plt.ylabel('Count')
plt.title('Top 10 Stations where trips started');
# Plotting top 10 end stations
end_station = df.end_station_name.value_counts()
top_10_end_station = end_station.nlargest(10)
top_10_end_station.plot(kind = 'bar')
plt.xlabel('Stations')
plt.ylabel('Count')
plt.title('Top 10 Stations where trips ended');
San Francisco Caltrain Station 2, Market St at 10th St and Montgomery St Bart Station are the most active stations.
They rank highly in stations with the most start and end trips.
# Proportion of user_types
df.user_type.value_counts() / len(df)
Subscriber 0.905311 Customer 0.094689 Name: user_type, dtype: float64
# plotting types of users on bar.
plt.bar(x = df.user_type.value_counts().keys(), height = df.user_type.value_counts())
plt.xlabel('User Type')
plt.ylabel('Number of Users')
plt.title('Trips by User_type');
Approximately 91% of the users of fordbike are subscribers and only 9% are termed customers.
# Proportion of gender
df.member_gender.value_counts() / len(df)
Male 0.745919 Female 0.233235 Other 0.020846 Name: member_gender, dtype: float64
# plotting genders on bar.
plt.bar(x = df.member_gender.value_counts().keys(), height = df.member_gender.value_counts())
plt.xlabel('Gender')
plt.ylabel('Number of Users')
plt.title('Trips by User_Type')
Text(0.5, 1.0, 'Trips by User_Type')
A low percentage of riders classify as others (2%). Just as expected, a higher percentage of the riders are male (75%) as against the 23% that are females.
The Trip Duration and Age distribution were rightly skewed and concentrated to a tail due to the high values in the extreme end. Several observations were lumped together. I had to use a log transform to better visualize the histogram.
Most of my visualization across time needed extraction from the original time columns(start_time and end_time). From these 2 columns, I extracted the start_period, month of trip, day of trip, period of trip which helped to provide further insight on the data.
Similarly, from the birth_time column, I was able to calculate the age of riders using FordBike.
I noticed that there were riders above 100 years and as were trips that lasted for almost a day.
I would explore the relationship the duration has with other variables like age, day of the week, period of the week, gender and user type.
# Visualizing the relationship between age and duration
corr_age_dur = df.duration_sec.corr(df.age)
corr_age_dur
0.006041174875254644
plt.figure(figsize=[8,6])
plt.scatter(df['age'], df['duration_sec'], alpha = 0.25, marker = '.' )
plt.axis([-5, 145, 500, 10500])
plt.xlabel('Age')
plt.ylabel('Duaration')
plt.title('Scatter plot of Duration against Age');
# Plotting the correlation heatmap
correlation1 = df[['age','duration_sec']]
plt.figure(figsize=[8,6])
sns.heatmap(correlation1.corr(), annot=True,fmt = '.3f', cmap = 'vlag_r', center = 0)
plt.title('Correlation Matrix');
A correlation of 0.006 show no correlation between the age of riders and duration in secs. The scatter plot also show no definite linear relationship between the 2 variables of interest. This is a suprising insight as I expected the age and duration to be negatively correlated. As age increases, the duration covered should proportionately reduce.
# Duration against day of the week
plt.figure(figsize = [8, 6])
sns.boxplot(data = df, x = 'start_day', y = 'duration_sec', color = 'blue')
plt.xlabel('Day of the Week')
plt.ylabel('Duration')
plt.title('Plot of Duration against Day of the week');
The high variation in the dataset prevents a proper view of the durations for various days of the week.
# Setting a limit of 4000secs for duration in secs
# Duration against day of the week
plt.figure(figsize = [8, 5])
sns.boxplot(data = df, x = 'start_day', y = 'duration_sec', color = 'blue')
plt.ylim([0, 4000])
plt.xlabel('Day of the Week')
plt.ylabel('Duration in sec')
plt.title('Plot of Duration against Day of the week');
We can deduce that rides take a longer time on weekends than weekdays. One reason for this could be that riders on weekdays used fordbike to commute to work while on weekends, it's more for exercise and leisure.
# Duration against gender type
plt.figure(figsize = [8, 6])
sns.boxplot(data = df, x = 'member_gender', y = 'duration_sec', color = 'blue')
plt.xlabel('Gender')
plt.ylabel('Duration in sec')
plt.title('Plot of Duration against Gender');
#Setting duration to a limit of 4000
# Duration against gender
plt.figure(figsize = [8, 5])
sns.boxplot(data = df, x = 'member_gender', y = 'duration_sec', color = 'blue')
plt.ylim([0, 4000])
plt.xlabel('Gender')
plt.ylabel('Duration in sec')
plt.title('Plot of Duration against Gender');
The duration across gender show different distribution. Both the female and other gender cover longer distance than the male gender.
# Duration against day of the week
plt.figure(figsize = [8, 6])
sns.violinplot(data = df, x = 'period_day', y = 'duration_sec', color = 'blue')
plt.xlabel('Period of the Trip')
plt.ylabel('Duration in sec')
plt.title('Plot of Duration against Period of Day');
#Setting duration to a limit of 4000
# Duration against gender
plt.figure(figsize = [8, 5])
sns.violinplot(data = df, x = 'period_day', y = 'duration_sec', color = 'blue')
plt.ylim([0, 4000])
plt.xlabel('Period of the Trip')
plt.ylabel('Duration in sec')
plt.title('Plot of Duration against Period of Day');
There's little difference in the duration of trips across the starting period. The duration of trips in the afternoon is slightly higher than other periods. This difference is also minimal.
#Setting duration to a limit of 4000
# Duration against user_type
plt.figure(figsize = [8, 5])
sns.boxplot(data = df, x = 'user_type', y = 'duration_sec', color = 'blue')
plt.ylim([0, 4000])
plt.xlabel('User Type')
plt.ylabel('Duration in sec')
plt.title('Plot of Duration against User_type');
# User_type and Period of the day
sns.countplot(data =df, y='period_day', hue = 'user_type')
plt.xlabel('Period of the Trip')
plt.ylabel('Duration in sec')
plt.title('Number of Trips by Period and User_type');
The visual above show that trips taken by subscribers begin to reduce as the time period changes. Trips by subscribers was highest in the morning, reduced in the afternoon and became the lowest in the night. Unlike the customer user_type which peaked in the afternoon. Trips for customers was increased in the afternoon and reduced at night.
There's no correlation between the duration of a trip and the age of a rider. Although, I expected a little negative relationship as it's expected that older riders to take shorter trips but this isn't all suprising. From the age distribution, we get to understand that most riders are between 20 and 50, with little representation from younger and elder populace.
Longer trips are common in weekends than on weekdays.
I expected a longer trip duration from the male gender. Suprisingly, the male riders took shorter trips than the female and other gender classification.
Also, I noticed that the customer user_type took longer trips than the subscriber riders. This is could be due to customers probably needed the bikes for a one-off journey that was longer unlike a subscriber who already has a mapped out route.
Question 1 : Understand user_type across duration and day of the trip
# Usertype, Start_day of trip and duration
plt.figure(figsize = [8, 5])
ax = sns.pointplot(data = df, x ='start_day', y = 'duration_sec', hue = 'user_type',
palette = 'Blues', linestyles = '', dodge = 0.4)
plt.title('Trip Duration against User_type and Days of the week')
plt.ylabel('Average Trip Duration (Secs)')
ax.set_yticklabels([],minor = True)
plt.show();
I previously noticed that customers went on longer trips than the subscribers. I decided to check if those trips were specific to some days. Our visualization show that those trips were not specific. Infact, on every day of the week, customers went on longer trips that the subscribers. On average, a customer would spend about 1200 seconds on a trip as compared to 700 seconds from the subscribers. Although, on weekends, trips are longer for both user types.
Question 2 : I will like to understand how females and other genders take more trips than males
# Usertype, Start_day of trip and duration
plt.figure(figsize = [8, 5])
ax = sns.pointplot(data = df, x ='start_day', y = 'duration_sec', hue = 'member_gender',
palette = 'Blues', linestyles = '', dodge = 0.4)
plt.title('Trip Duration against Genders and Days of the week')
plt.ylabel('Average Trip Duration (Secs)')
ax.set_yticklabels([],minor = True)
plt.show();
The gender type classified as others have the longest trip duration on all days. We can still see that the female gender took longer than the males on every day of the week in the month of february. It would be interesting to know the exact cause of this
# Creating a list of the top 10 start stations
lista = top_10_start_station.keys()
# splitting dfs into the three gender classification and top 10 start stations
df_f = df[(df.member_gender == "Female") & (df.start_station_name.isin(lista))]
df_m = df[(df.member_gender == "Other") & (df.start_station_name.isin(lista))]
df_o = df[(df.member_gender == "Male") & (df.start_station_name.isin(lista))]
# Stations against gender and periods
plt.figure(figsize=[8,8])
sns.countplot(data=df_f, y='start_station_name', hue='period_day')
plt.legend(loc='center left', bbox_to_anchor=(1,0.5))
plt.title('Top 10 Trip Stations by Time of Day in Females')
plt.xlabel('Count')
plt.ylabel('Start Station Name');
plt.figure(figsize=[8,8])
sns.countplot(data=df_o, y='start_station_name', hue='period_day')
plt.legend(loc='center left', bbox_to_anchor=(1,0.5))
plt.title('Top 10 Trip Stations by Time of Day in Other gender')
plt.xlabel('Count')
plt.ylabel('Start Station Name');
plt.figure(figsize=[8,8])
sns.countplot(data=df_m, y='start_station_name', hue='period_day')
plt.legend(loc='center left', bbox_to_anchor=(1,0.5))
plt.title('Top 10 Trip Stations by Time of Day in Males')
plt.xlabel('Count')
plt.ylabel('Start Station Name');
Females moved more in the morning than any other period. Most trips occurred at 3 stations - Market St at 10th St, San Francisco Caltrain Station 2 and Berry St at 4th St.
For the Other gender, most trips occurred similarly in the morning but with little difference in the afternoon.Stations with most trips - Market St at 10th St, San Francisco Caltrain Station 2 and San Francisco Caltrain Station.
Males had more trips at Market St at 10th St, Powell St BART Station, and Montgomery St BART Station. Apart from the 3 most used station difference, we can see that males moved more in the afternoon than at morning. Probably the reason why they take shorter trips. Also the location of San Francisco Caltrain Station 2 which is the most busy station, could be a reason for the longer distance for females and other genders. The males had fewer trips from that station